the decision blog

the back-propagation equations for a convolutional network

Written on

In this post, we'll derive the back-propagation equations for our convolutional net, which has the structure shown below. The square blocks refer to 2D matrices, while the rectangles represent matrices with a single row. They are drawn vertically, however, to make the diagram cleaner.

matrix size meaning
\(\textbf{k}^i\) \(1\times n^k\) i\(^{th}\) kernel
\(\textbf{F}\) \(n^k \times n^0\) image frame captures
\(\textbf{l}^{ci}\) \(1\times n^0\) i\(^{th}\) pre-pooled layer
\(\textbf{l}^0\) \(1\times n^0\) pooled convolution layer
\(\textbf{w}^{01}\) \(n^0 \times n^1\) weights from layer 0 to 1
\(\textbf{l}^1\) \(1\times n^1\) hidden layer
\(\textbf{w}^{12}\) \(n^1 \times n^2\) weights from layer 1 to 2
\(\textbf{l}^2\) \(1\times n^2\) output layer

In terms of components, the convolutional layer is given by

$$ l_i^{cj}=\sum_m k_m^j F_{mi}, $$

and

$$ l_k^0 = \tanh \left[ \max \left(l_k^{c0},l_k^{c1},...,l_k^{cN}\right) \right]. $$

For layers 1 and 2 we have

$$ l_q^1=\tanh \left[\sum_r l_r^0 w_{rq}^{01}\right]. $$

and

$$ l_n^2 = \sigma \left[\sum_l l_l^1 w_{ln}^{12} \right], $$

where \(\sigma\) is the softmax function. The loss function is

$$ \epsilon = \frac{1}{2}\sum_p \left(l_p^2-y_p \right)^2, $$

and by applying the chain rule to these equations we will derive the back-propagation equations.

Similarly to the 3 layer non-convolutional net, we have

$$ \frac{\partial \epsilon}{\partial w_{qs}^{12}}=\left[\textbf{l}^1\otimes\left(\textbf{l}^2-\textbf{y}\right)\textbf{D}_{s'}\right]_{qs} $$

and

$$ \frac{\partial \epsilon}{\partial w_{qs}^{01}}=\left[\textbf{l}^0 \otimes \left(\textbf{l}^2-\textbf{y}\right)\textbf{D}_{\sigma'}\textbf{w}^{12,T}\textbf{D}_{t'}^1\right]_{qs}. $$

but now in addition to these we have

$$ \frac{\partial \epsilon}{\partial k_q^i}=\sum_{n,m,p}\frac{\partial \epsilon}{\partial l_m^2}\frac{\partial l_m^2}{\partial l_n^1}\frac{\partial l_n^1}{\partial l_p^0}\frac{\partial l_p^0}{\partial k_q^i}. $$

These derivatives are

$$ \frac{\partial \epsilon}{\partial l_m^2}=\sum_p \left(l_p^2-y_p\right)\delta_{pm}=\left(l_m^2-y_m\right), $$
$$ \frac{\partial \l_m^2}{\partial l_n^1}=\sigma_m'\sum_l w_{lm}^{12}\delta_{ln}=\sigma_m' w_{nm}^{12}, $$
$$ \frac{\partial l_n^1}{\partial l_p^0}=t_n'\sum_r \delta_{pr}w_{rn}^{01}=t_n'w_{pn}^{01}, $$

and

$$ \frac{\partial l_p^0}{\partial k_q^i}=\frac{\partial}{\partial k_q^i}\{t\left[\max(l_p^{c0},l_p^{c1},...,l_p^{cN})\right]\}=t_p^{0'}\frac{\partial l_p^{c,w(p)}}{\partial k_q^i}, $$

where \(w(p)\) is the index of the largest \(l_p^{ci}\) ("w" is for "winner"). This last derivative is given by

$$ \frac{\partial l_p^{c,w(p)}}{\partial k_q^i}=\frac{\partial}{\partial k_q^i}\left[\sum_m k_m^{w(p)}F_{mp}\right]=\sum_m F_{mp}\delta^{i,w(p)}\delta_{mq}=F_{qp}\delta^{i,w(p)}. $$

Defining

$$ \mathcal{F}_{qp}^i\equiv F_{qp}\delta^{i,w(p)}, $$

this result can be written more compactly as

$$ \frac{\partial l_p^{c,w(p)}}{\partial k_q^i}=\mathcal{F}_{qp}^i, $$

so that

$$ \frac{\partial l_p^0}{\partial k_q^i}=t_p^{0'}\mathcal{F}_{qp}^i. $$

Hence, for the components of the gradient corresponding to the kernels, we obtain

$$ \frac{\partial \epsilon}{\partial k_{q}^i}=\sum_{n,m,p}\left(l_m^2-y_m\right)\sigma_m' w_{mn}^{12,T}t_n^{1'} w_{np}^{01,T}t_p^{0'}\mathcal{F}_{pq}^{i,T}, $$

which can be written as

$$ \frac{\partial \epsilon}{\partial k_{q}^i}=\left[\left(\textbf{l}^2-\textbf{y}\right)\textbf{D}_{\sigma'}\textbf{w}^{12,T}\textbf{D}_{t'}^1\textbf{w}^{01,T}\textbf{D}_{t'}^0 \mathcal{F}^{i,T} \right]_q. $$

Discuss on Twitter